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Abstract 

Thermal  management  in  microprocessors  has  become  a 
major  design  challenge  in  recent  years.  Thermal  monitoring 
through  hardware  sensors  is  important ,  and  these  sensors 
must  be  carefully  placed  on  the  chip  to  account  for  thermal 
gradients.  In  this  paper,  we  present  an  analytical  model  that 
describes  the  maximum  temperature  differential  between  a 
hot  spot  and  a  region  of  interest  based  on  their  distance 
and  processor  packaging  information.  We  also  use  a  run¬ 
time  thermal  model,  as  an  illustration  of  virtual  sensors, 
and  examine  two  benchmarks  that  exhibit  highly  concen¬ 
trated  thermal  stress.  We  then  use  our  analytical  model  to 
demonstrate  the  safety  margins  of  the  chip.  Ultimately,  the 
mathematical  expression  allows  designers  to  obtain  worst- 
case  behavior  of  thermal  heatup  and  select  the  optimal  lo¬ 
cation  of  additional  sensors. 


1.  Introduction 

As  an  effort  to  reduce  costs  associated  with  increasing 
heat  dissipation  in  processors,  researchers  have  developed 
various  dynamic  thermal  management  (DTM)  techniques. 
Ideally,  a  chip's  temperature  would  be  monitored  at  a  fine 
granularity  using  many  sensors.  However,  most  sensors  are 
based  on  analog  CMOS  circuit  designs.  Precise  temperature 
measurement  typically  requires  matched  transistors,  and  as 
process  variations  grow  in  severity,  accurate  sensors  may 
require  large  device  sizes  to  compensate.  This  in  turn  in¬ 
creases  their  size  and  power  requirements.  On-chip  sen¬ 
sors  can  also  be  difficult  to  calibrate,  and  adding  sensors 
may  therefore  increase  testing  costs.  To  date,  CPU  designers 
have  used  at  most  a  few  sensors.  (One  notable  exception  is 
the  recently  announced  IBM  POWER5  [1],  with  24  on-chip 
sensors.)  The  limited  number  of  sensors  makes  their  place¬ 
ment  on  the  chip  important  for  reliability.  It  might  become 
possible  to  create  a  program  that  targets  a  unit  far  away  from 
the  sensor  for  overheating.  The  localized  overheating  may 
not  be  visible  to  distant  sensors,  and  cause  permanent  dam¬ 


age  to  the  chip.  The  choice  of  thermal  safety  margins  will 
also  dictate  sensor  number  and  placement;  smaller  margins 
are  less  wasteful  but  require  more  sensors. 

This  paper  proposes  a  framework  to  examine  the  max¬ 
imum  temperature  gradient  in  processors  and  place  sen¬ 
sors  accordingly  to  account  for  cross-chip  temperature  vari¬ 
ations.  Virtual  sensors,  which  are  modeled  in  software,  are 
explored  as  an  alternative  to  hardware  sensors.  We  also 
present  a  simple  case  study  that  examines  thermal  security 
risks  and  implications  for  sensor  placement. 

The  paper  is  structured  as  follows.  Section  2  gives  an 
overview  of  the  expression.  Section  3  explains  the  use  of 
virtual  sensors.  Section  4  presents  a  case  study  using  the  ex¬ 
pression  to  examine  thermal  security  risks.  Section  5  con¬ 
cludes  the  paper. 

2.  Modeling  Temperature  Differentials 
2.1.  Deriving  the  Equation 

Understanding  the  various  hot  spots  on  the  chip  is  one  of 
the  most  important  factors  to  consider  when  placing  a  hard¬ 
ware  sensor  on  a  chip.  Hot  regions  can  be  effectively  located 
through  simulations  or  runtime  measurements.  But  Lee  and 
Skadron  [5]  have  also  demonstrated  that  hot  spots  can  move 
around  the  chip  during  the  execution  of  the  program.  Hence, 
locating  a  sensor  near  one  well  known  hot  spot  is  often  not 
enough  to  ensure  that  that  is  the  maximum  temperature  on 
the  chip.  Another  important  factor  to  consider  is  the  max¬ 
imum  possible  temperature  differential  between  a  hot  spot 
and  a  potential  sensor  location.  This  information  can  then 
be  used  to  determine  various  safety  margins  and  the  most 
effective  spacing  between  hardware  sensors  as  well  as  per¬ 
haps  the  optimal  number  of  sensors. 

The  following  expression  describes  the  maximum  tem¬ 
perature  differential  from  a  given  location.  The  temperature 
at  the  hot  spot  (source)  -where  r  =  0-  is  the  highest.  At  a 
point  of  interest,  located  at  a  distance  r  from  the  hot  spot, 
the  temperature  difference  between  the  two  points  is  given 
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by  Tsrc(r);  i.e.,  the  temperature  at  r  is  Tsrc{r)  degrees  cooler 
than  that  at  the  source. 

Tsrcif)  =  Tsrc—ma x  X  ^1  —  e  K  ^  (1) 

The  constant  K  denotes  the  thickness  of  the  processor 
package  -die,  heat  spreader,  and  thermal  interface  material- 
in  terms  of  silicon.  The  thickness  of  each  packaging  mate¬ 
rial  is  multiplied  by  a  factor  based  on  the  thermal  resistivity 
of  the  material  and  that  of  silicon.  After  obtaining  the  equiv¬ 
alent  thickness  of  the  material  in  terms  of  silicon,  these  val¬ 
ues  are  summed  to  obtain  the  value  of  K.  Tmax  can  be  de¬ 
rived  for  two  different  cases:  one  where  there  is  a  single 
power  source,  and  the  other  where  there  are  multiple  power 
sources.  When  there  is  a  single  power  source,  the  tempera¬ 
ture  drops  off  exponentially  as  the  distance  from  the  source 
increases.  When  the  source  is  at  its  local  maximum  power 
density,  the  value  of  Tmax  can  then  be  derived  as  the  dif¬ 
ference  between  the  maximum  and  minimum  temperature 
value. 

However,  there  is  limited  use  for  the  equation  for  a  single 
power  source.  In  practice,  each  functional  unit  can  be  mod¬ 
eled  as  a  power  source,  and  many  of  these  will  be  active 
simultaneously.  The  complexity  of  having  multiple  power 
sources  interacting  does  not  easily  permit  us  to  find  an  ex¬ 
act  equation  to  fit  the  temperature-distance  curve.  Nonethe¬ 
less,  we  can  derive  Tmax  to  be  an  upper  limit.  The  effects  of 
different  power  sources  are  assumed  to  be  independent  of 
each  other  and  follow  a  linear  relation  allowing  superposi¬ 
tion.  Thus,  for  multiple  power  sources,  Tsrc-max  can  be  de¬ 
rived  as  follows: 


2.2.  Using  the  Temperature  Curve 

In  this  section  we  explain  two  possible  ways  to  use  the 
temperature  curve,  both  illustrated  in  Figure  1.  First,  the 
curve  tells  us  the  maximum  temperature  differential  be¬ 
tween  a  hot  spot  and  a  thermal  sensor.  For  example,  let  the 
temperature  at  the  hottest  spot  -r  =  0-  be  Tq.  Assume  that 
the  sensor  is  located  at  a  distance  of  rs.  The  curve  will  in¬ 
dicate  that  the  maximum  possible  temperature  difference  is, 
for  example,  7a  =  7’v„  (rs).  Thus,  if  the  the  sensor  reads  a 
temperature  value  of  Ts,  then  Tq  —  Ts<T\,  or  To  <  Ts  +  T\. 
If  the  hottest  spot  starts  to  heat  up  rapidly,  as  in  a  thermal 
virus,  ?a  indicates  the  maximum  temperature  gradient  be¬ 
tween  the  two  points  no  matter  how  high  the  temperature 
rises. 


Ta  = 

r<R_s) 

a> 

o 

c 

<D 

Q 

a> 

3 

m 

a> 

a 

E 

|0> 

Delta-T 

R_s 

ensor  distance  from  source  (hot  spot) 

Figure  1.  Maximum  temperature  differential 


T'src—max —  ^  (7/  max  fa  X  Tfrsrc—i ) )  (2) 

iGUnits 

The  Ti^max  contribution  for  each  different  unit  is 
summed  up,  then  subtracted  by  a  contribution  equal  to  the 
temperature  that  each  unit  adds  to  the  source.  Since  the  7) 
values  are  derived  from  maximum  power  cases,  the  lat¬ 
ter  part  is  multiplied  by  an  activity  factor  in  the  range  of  0.0 
to  1.0.  If  fa  equals  1.0,  then  all  the  power  sources  are  active 
and  outputting  maximum  power.  Hence,  Tsrc-max  is  min¬ 
imized  and  the  difference  between  different  heat  sources 
is  smaller.  This  is  consistent  with  our  intuition,  since  the 
temperature  difference  between  two  points  would  actu¬ 
ally  be  smaller  if  both  points  were  acting  as  a  power  source. 
A  smaller  fa  value  indicates  that  most  units  are  less  ac¬ 
tive  while  one  or  more  units  has  the  potential  to  be  highly 
active.  This  results  in  a  larger  Tsrc-max  value,  which  im¬ 
plies  that  a  larger  temperature  gradient  is  possible.  A  ther¬ 
mal  stressmark  or  virus  takes  this  behavior  to  an  extreme, 
creating  a  very  hot  region  potentially  far  away  from  the  sen¬ 
sor.  Section  4  explains  this  behavior  and  how  to  use  the 
equation  in  more  detail. 


In  addition,  if  a  sensor  is  already  placed  near  a  hot  spot, 
the  temperature  curve  can  be  used  to  place  a  second  sensor 
so  that  the  temperature  difference  between  two  them  cannot 
be  any  larger  AT.  Solving  for  rsensor  in  equation(  1)  gives: 

rse„sor  =  0.5.  K.  In  (T  TsK-Z\r)  (3) 

\  isrc—max  J 

3.  Virtual  Sensors 

Placing  as  many  sensors  as  possible  would  be  ideal,  how¬ 
ever,  design  constraints  may  limit  the  number  of  sensors  that 
can  be  implemented  on  a  chip.  Nonetheless,  not  all  sen¬ 
sors  have  to  be  physical  CMOS  circuits.  Virtual  sensors  - 
modeled  through  software-  can  be  created  to  supplement 
the  existing  hardware  sensors.  The  equations  described  in 
Section  2  do  not  distinguish  between  physical  sensors  and 
virtual  sensors. 

Skadron  et  al.  proposed  a  thermal  model  called  HotSpot , 
which  computes  the  temperature  for  each  microarchitec¬ 
ture  block  on  the  chip  [6].  In  prior  work,  we  have  extended 
HotSpot  to  interface  with  performance  counters  [5]  based 


Proceedings  of  the  2005  International  Conference  on  Computer  Design  (ICCD'05) 
0-7695-2451-6/05  $20.00  ©  2005  IEEE 


Computer 

SOCIETY 


on  a  power  model  by  Isci  and  Martonosi  [4],  This  creates  a 
dynamic  online  model  that  estimates  temperature  readings 
based  on  real  workload  at  runtime.  This  creates  virtual  sen¬ 
sors,  one  for  each  functional  unit. 

One  important  area  for  future  work  is  to  develop  a  very 
simple  thermal  computation  based  on  performance  counters 
to  model  virtual  sensors.  While  HotSpot  has  the  advantage 
of  providing  a  very  detailed  thermal  profile  of  the  proces¬ 
sor,  it  may  be  desirable  to  trade  off  some  accuracy  to  obtain 
a  radically  simpler  algorithm. 

4.  Case  Study 

4.1.  BPU  Stressmarks 

The  main  purpose  of  this  section  is  to  demonstrate  the 
use  of  our  analytical  expression  as  a  sensible  method  for 
examining  sensor  placement.  This  case  study  examines  two 
artificial  benchmarks  that  were  created  to  emulate  the  char¬ 
acteristics  of  thermal  viruses;  highly  concentrated  thermal 
stress  in  one  unit,  and  minimal  activity  in  other  regions. 
We  use  the  extended  HotSpot  model  as  virtual  sensors  [5]. 
Although  the  extended  Pentium  4  HotSpot  model  has  not 
yet  been  fully  validated  for  accuracy,  the  resulting  thermal 
stress  patterns  nevertheless  provide  valuable  information  to 
demonstrate  our  equation  for  effective  sensor  placement  and 
for  studying  localized  heating  effects  on  the  chip. 

Both  benchmarks  attempt  to  stress  the  branch  prediction 
units.  The  first  experiment  uses  a  program  that  contains  ap¬ 
proximately  90  if-statements  within  a  large  loop.  During  the 
program’s  peak  execution,  the  results  show  that  the  temper¬ 
ature  of  the  L2J3PU  increases  by  roughly  1 2"C.  While  the 
L2_BPU  does  heat  up  more  than  usual,  the  integer  units 
still  remain  the  hottest  units  on  the  chip.  This  is  consis¬ 
tent  with  most  applications;  the  hot  spots  are  located  in  the 
lower  region  of  the  chip  where  the  integer  units  and  the  re¬ 
name/queue/scheduling  units  are  located. 


Figure  2.  BPU  stressmark  #2 


The  second  benchmark  slightly  modifies  the  first  bench¬ 
mark  program.  Within  the  compiled  assembly  code  of  the 
first  program,  nearly  all  cmpl  and  movl  instructions  are  re¬ 
moved,  thus  creating  a  series  of  conditional  branch  instruc¬ 
tions  -jg  or  jle-  inside  the  main  loop.  The  results  show  that 
the  temperature  of  the  L2_BPU  increases  above  that  of  the 
integer  execution  core.  Figure  2  shows  the  thermal  map  at 
a  particular  instance  in  time.  Exact  temperatures  are  uncer¬ 
tain  due  to  uncertainties  in  our  model. 

4.2.  Analysis 

Intel  has  carefully  considered  every  possible  thermal 
gradient  in  its  sensor  placement  and  design  margin.  The 
Pentium  4  also  has  advanced  thermal  management  tech¬ 
niques  [2,  3].  We  use  the  Pentium  4  in  our  study  not  be¬ 
cause  we  can  induce  damage,  but  believe  it  is  safe  to  exper¬ 
iment  with  given  all  its  built-in  thermal  protection.  Despite 
the  experimental  limitations  of  the  BPU  stressmark  char¬ 
acterization,  it  gives  us  a  hypothetical  scenario  in  which  a 
unit  heats  up  remote  from  any  sensor  while  the  units  near 
the  sensor  remain  colder.  This  lets  us  test  our  analysis  us¬ 
ing  the  framework  developed  in  Section  2. 

In  order  to  find  the  Tsrc-max  when  there  are  multiple 
power  source,  we  need  to  first  find  the  single-source  Tmax 
value  for  every  unit  of  the  Pentium  4  processor.  For  each 
unit,  we  artificially  assign  each  unit’s  local  maximum  power 
and  assign  zero  power  to  all  other  units.  We  then  obtain 
the  steady-state  temperature  readings  and  calculate  Tmax. 
Since  the  floorplan  dimensions  are  known,  we  can  plot  this 
temperature-distance  relation.  The  distance  is  approximated 
as  the  distance  between  the  center  of  two  units. 

To  analyze  the  BPU  stressmarks  in  Figure  2,  note  that 
the  IntReg  and  the  L2J3PU  are  the  hottest  units  on  the  chip 
respectively.  Table  1  shows  the  results  for  all  of  the  key  pa¬ 
rameters  derived  using  equation(  2) .  The  activity  factor  was 
derived  using  the  two  BPU  stressmarks  in  Section  4. 1 .  This 
factor  could  be  adjusted  for  average  or  worst-case  scenar¬ 
ios. 


K  (mm) 

fa 

Tmax{multi) 

Tmax(single) 

L2 BPU 

2.69 

0.4 

36.37  °C 

10.9  °C 

IntReg 

2.69 

0.5 

21.11  °C 

1.34  °C 

Table  1 .  Parameters  for  the  temperature  equa¬ 
tion 

Using  these  data,  we  can  obtain  the  equations  that  de¬ 
scribe  the  maximum  temperature  differential  when  the  In¬ 
tReg  and  the  L2_BPU  are  the  hottest  units  respectively.  Fig¬ 
ure  3  shows  these  curves  as  well  as  the  actual  observed  tem¬ 
perature  data.  The  actual  data  is  taken  from  the  same  tem¬ 
perature  sample  seen  in  Figure  2.  When  IntReg  is  the  hottest 
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unit  (not  shown  in  Figure  2),  all  the  data  points  lie  under  the 
Tjreg(r)  curve.  In  Figure  2,  L2_BPU  is  the  hottest  unit,  and 
all  the  data  points  in  Figure  3  fit  well  within  the  7j,p„(r) 
curve. 


Figure  3.  Benchmark  Temperature  Curves 

If  most  functional  units  are  below  normal  operating  tem¬ 
peratures,  heating  could  be  too  isolated  and  the  thermal 
monitor  may  not  operate  or  may  respond  too  late  when  the 
chip  has  already  been  permanently  damaged.  The  results  in 
Section  4. 1  show  early  work  indicating  the  conceptual  pos¬ 
sibility  of  thermal  security  attacks  through  thermal  viruses. 
However,  as  seen  in  Figure  3,  these  benchmarks  do  not  ex¬ 
hibit  thermal  behavior  near  the  thermal  limits  described  by 
the  temperature  curves.  The  analysis  further  supports  the 
fact  that  the  Pentium  4  is  not  vulnerable  in  our  case  study. 

Hardware  designers  can  gain  a  better  understanding  of 
the  thermal  gradient  bounds  using  the  analytical  method 
presented  in  this  paper.  By  effectively  identifying  thermal 
safety  margins  through  average  or  worst-case  thermal  be¬ 
havior,  designers  may  be  able  to  recognize  the  need  for 
more  sensors  as  well  as  improve  sensor  placement  on  the 
chip.  Furthermore,  not  all  processors  have  on-chip  sensors, 
or  place  them  in  the  package  but  not  in  the  chip  itself.  Those 
chips  may  be  vulnerable  to  thermal  attacks  unless  they  use 
very  large  design  margins. 

5.  Conclusions 

Uneven  activity  associated  with  each  functional  unit 
within  a  processor  results  in  localized  hot  spots.  If  this  is 
not  factored  into  the  design  or  monitored  by  sensors  -real 
or  virtual-  then  reliable  operation  may  be  at  risk.  Thus,  it  is 
important  to  be  able  to  efficiently  place  hardware  sensors  to 
monitor  the  temperature  of  the  processor.  The  equation  de¬ 
scribed  in  Section  2  provides  a  qualitative  limit  in  terms  of 
the  thermal  gradient  that  is  achievable. 

Ultimately,  hardware  sensors  and  virtual  sensors  can  be 
used  in  conjunction  to  provide  detailed  information  about 
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the  thermal  distribution  of  the  processor.  Design  constraints, 
including  cost,  may  limit  the  number  of  hardware  sensors. 
Creating  efficient  algorithms  for  virtual  sensors  is  an  impor¬ 
tant  area  of  future  work.  Our  analytical  method  can  guide 
designers  to  effectively  use  such  sensors  to  provide  more 
thermal  protection.  Our  work  illustrates  the  need  for  on-chip 
sensors  and  their  careful  placement  to  avoid  thermal  risk. 
While  we  used  the  Pentium  4  in  our  case  study,  its  thermal 
design  is  robust.  The  POWER5  is  even  more  aggressive, 
with  24  on-chip  sensors  [1].  Chips  without  on-chip  sensors 
are  either  incurring  excessive  design  margin  or  may  be  vul¬ 
nerable  to  thermal  attacks. 
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